home *** CD-ROM | disk | FTP | other *** search
- GetAllHTML v0.64ß Copyright 1998 Christoper S Handley
- ======================================================
- Latest News
- -----------
- A fix I did in v0.61 was actually wrong - undone so that all downloading should
- work properly now. Also improved BROKENLINKS & other minor things. I actually
- had time to test this release, so it should work pretty well! :-)
-
- Many people have been having problems with GetAllHTML after editing it - seems
- this is due to spurious ASCII-27 characters mucking-up some editors :-( .
- Anyway, I wrote a program to detect & remove all non-visible characters
- (available if wanted), and it seems that GetAllHTML is the only recent text file
- I wrote which had the problem... Any ideas WHY they appeared? I use CygnusEd
- v3.5.
-
- I've programmed the BROKENLINKS switch to allow web page makers to automagically
- search their site for broken links - written just for Alexander Niven-Jenkins
- (emailing me can be worth it;-)
-
- Changed the NOPAUSE switch to PAUSE, so that it defaults to NOT pausing.
-
- Very minor enhancments & fixed an arguments interpreting bug.
-
- I will still fix major bugs until I have an AmigaE version that can be tested.
-
- Introduction
- ------------
- Have you ever visited a cool web site & wanted to keep a copy of some/all of it,
- but it would takes ages to find & download all the respective pages/files?
-
- This is the answer!
-
- You supply this ARexx script with the start page URL, and a destination
- directory (which should be empty), and maybe a few other options - and off it
- goes! Note that it needs HTTPResume v1.3+ to work (get from Aminet).
-
- The idea for this came from a PC Java program called PageSucker - sadly it is
- over 1Mb in size & buggy (& can't be run on the Amiga, yet!). Although my
- implementation may not have quite as many features, it is does do the job quite
- fast, & has fairly low memory overheads.
-
- Requirements
- ------------
- o HTTPResume v1.3+
-
- o An Amiga capable of running ARexx programs
-
- o Libs:Rexxsupport.library
-
- o Modem with TCP/IP stack (like Genesis, Miami, or AmiTCP)
-
- Usage
- -----
- 1.Before running it for the first time you must (text) edit (say using C:ED) it
- to know where your copy of HTTPResume is located. Go to line 19 where it says:
- HTTPResume='Programs:Utils/Comms/HTTPResume'
- Alter the file path between the 'quotes' to where you keep HTTPResume, and save.
-
- 2.Run your TCP/IP stack (e.g.AmiTCP/Genesis or Miami).
-
- 3.Run it from a Shell using:
-
- Sys:RexxC/Rx GetAllHTML arguments
-
- Where the arguments are:
-
- "URL"/A, "DestDir"/A, NOASK/S, ARC/S, PIC/S, RESUME/S, PAUSE/S, DEPTH=/N/K, PORT=/K,
- BASEURL=/K, BROKELINKS/S
-
- Note - The destination dir must be empty of any previous attempt at downloading
- that web page (unless using the RESUME switch). And that both URL & DestDir
- *must* be enclosed in "double quotes" - and that the BASEURL must NOT be
- surrounded by quotes!
-
- *Note* you may have several GetAllHTMLs & HTTPResumes running at the same time
- (not on the same URL!), and that if you use the PORT argument then you will need
- HTTPResume running first.
-
- See the file GetAllHTML_ex.script for an example usage - it will download all of
- Squid's on-line artwork (hope he gets a few more sales of his wonderful 'other
- world' artwork from this :-).
-
- Behaviour
- ---------
- It's default behaviour is to find all links in each HTML page, and download them
- if:
-
- -the URL path is a sub-directory of the original URL path; this stops
- downloading irrelevant pages on different topics, different servers, etc.., AND
-
- -if they are HTML pages (name ends in .html, etc), OR
-
- -if not HTML pages then it will ask if the file should be downloaded; if answer
- does not begin with an "n" then it does (download). If the file has the same
- suffix as the last positively confirmed download, then intelligently assumes
- should download...
-
- This behaviour is modified by various switches:
-
- RESUME - Should downloading of pages have been interupted (maybe a crash), run
- GetAllHTML with *exactly* the same options, except with this switch too.
- It will take a while to reach the same place as all previous HTML pages
- must be scanned - and some additional memory usage is incured.
-
- I suggest you don't go on-line until it has reached the previously
- interupted point (it wait for you to press return).
-
- *NOTE* that this mode is flawed due to the way GetAllHTML.rexx works,
- so that it will sometimes think it has reached the previously finished
- point, when it has not in fact. Still RESUME is very useful! And an
- AmigaE version would fix this.
-
- PIC - Will identify links to pictures & download them rather than ask.
-
- ARC - Will identify links to archives & download them rather than ask.
-
- NOASK - Do not ask user if should download a file, assume should not.
-
- PAUSE - DO ask user to "press <return>" if we get an empty URL or couldn't
- download a file. Always asked to by the RESUME function.
-
- TERSE - Only outputs very important text, so it won't report strange URLs,
- failure to download files, non-http links, etc...
-
- PORT - Supplies the ARexx port of a *running* HTTPResume; if supplied then
- does not try to launch HTTPResume from AmigaDOS. See *Note* below. If
- no port name is supplied then it sees if the current ARexx port is
- already set to some HTTPResume - if not then an error is generated, else
- it just uses that port.
-
- DEPTH - This allows you to specify how many URL links to follow in sequence;
- i.e.the depth of the search. DEPTH=1 means download only the links from
- the original page, and so on for DEPTH=2, etc.. See *Note* below. If
- no number is supplied then the user is asked for a number.
-
- BASEURL - This allows you to override the semi-intelligent default, and tell
- GetAllHTML the base URL - that is, what the URL must start with for it
- to even consider downloading it. This is useful if you wish to download
- from a particular page deep down the directory structure, but which
- references images (or maybe pages) that are further up.
-
- BROKENLINKS - This causes attempted downloading of pages that are not
- sub-directories of the original URL. These will be downloaded to "T:"
- (which is usually "RAM:T") & then deleted. If download failed then you
- will be told that there was a broken link. See suggested uses below.
-
- *Note* that both DEPTH & PORT must be followed by an equals sign ("=") and then
- the data, _without_ any spaces between anything. This is due to a limitation of
- ARexx, for which an AmigaE version would fix.
-
- Suggested uses
- --------------
- 1.There's a big web site with lots of pictures/music/archives/information that
- interests you. Depending on what you want, you will need to use the PIC, ARC,
- and NOASK switches.
-
- For instance, if you are only interested in pictures then use PIC & NOASK. If
- you are only interested in archives then use ARC & NOASK. If you are interested
- in something else than pictures or archives (in addition to the web pages), then
- don't use any of those three switches - GetAllHTML will ask you if something
- should be downloaded.
-
-
- 2.You have your own home-page on the web, and it includes lots of links to other
- sites for which take hours to check they are all valid. Point GetAllHTML at
- your web site with the BROKENLINKS switch. Note will never try to download a
- link twice, even withOUT using RESUME.
-
- In fact, if you have your web site in a directory on your HD, then you could
- RESUME with that directory as your download directory; this will be MUCH faster
- since none of your pages will (or should) be downloaded :-)) . First time you
- try this, do it on a back-up copy to ensure GetAllHTML does not do anything
- strange (I won't be held responsible for extra files 'magically' appearing!)..
-
- 3.If you have a favourite news page then you can use GetAllHTML to download just
- the latest news by using RESUME. Suggest use NOASK, possibly with PIC if you
- want pictures too. Cool eh?
-
-
- Any other ideas?
-
- Bugs & other unwelcome features
- -------------------------------
- o The RESUME feature could in theory cause some files to be missed, with a VERY
- unlikely combination of file and/or directory names - it was decided to do this
- to obtain a large speed-up over a full-proof method. An AmigaE version would
- fix this.
-
- o Interpretation of the HTML & URLs is based on observation rather than any
- specification of these standards - thus there will probably be rare cases in
- which it may misinterpret them; as long as these are reported (along with the
- responsible HTML file(s)), fixes will probably be forth coming.
-
- o You cannot go above a depth of 42; this is to protect against an ARexx
- limitation which will cause problems above a depth of about 45. An AmigaE
- version would fix this.
-
- Technical info
- --------------
- GetAllHTML uses a depth-first tree search, via recursion, and uses the existance of
- (downloaded) files as a super-fast record of whether a page has been visited or
- not.
-
- When RESUMEing, existance of files cannot be used to record if a page has been
- visited, so an alternate method is used - this is slower, and could fail with
- certain combinations of strangely named URLs (very unlikely); a far slower
- method would avoid this, but was considered unnecessary.
-
- I used the INTERPRET command to do some magic with ARexx to make the arbitarily
- long linked-lists (really branches) possible - they were required for storing
- what pages have been visited. Although this method is not very memory efficient
- (many duplicate entries of the same URL), it is quite fast - and more
- importantly it *works* in ARexx. I had thought it would be virtually impossible
- to make arbitarily extended link-lists in ARexx, but the interpretive nature of
- ARexx means you can effectively create ARexx commands on-the-fly.
-
- Future
- ------
- Little development of the *ARexx* version of GetAllHTML is likely - my efforts
- are now on the super-fast AmigaE version: This will vastly speed-up RESUMEing,
- use far less memory & CPU usage, and may even give substantial speed-ups to
- normal operations as well (as the time between downloading files will be reduced
- to almost nothing). I hope it will eventually support multiple downloads at
- once, to minimise the total download time.
-
- For those interested (& as an advert!), since finishing v0.51ß I have been
- working on 2 (Object Orientated) modules for AmigaE which will make writing a
- conversion of GetAllHTML very easy: ARexxComm (for stupidly easy ARexx
- communication), and CString ('crashless' & very easy string handling - E's
- built-in string functions make string VERY much harder than in say ARexx or
- Basic).
-
- ARexxComm is fully working, and is on the Aminet. CString is NOT yet finished,
- but I it may be soon - may get an Aminet release too.
-
- As a side note, once CString is finished, I will start work on the E version of
- GetAllHTML - BUT, I also have 2 pet projects that will use up much of my
- programming time too... Emails prompting me to write GetAllHTML may get it
- written a bit quicker ^_^ though a 32Mb SIMM would do wonders ;-)
-
- BTW, I could release my first AmigaE program - a simple DIF program; there was
- no fast DIF program that was suited to AmigaDOS scripts (so I wrote it:).
-
- About the Author
- ----------------
- My name is Christopher S Handley, aged 21, my (permanent) email address is
- Chris.S.Handley@BTInternet.com. I have got a 2.1 BEng(Hons) in Electronic
- Engineering (specialising in digital system & chip design) after 3 years at
- Sheffield University, and am currently doing a 1 year MSc in the same area. I'm
- also looking for possible jobs - doubt I'll get any offers cos of GetAllHTML
- though!
-
- In no particular order my interests include Science Fiction (mostly hard SF
- books, esp.robot stories), Japanese Anime (ones with believable characters &
- decent plots, or at least very funny, and definitely *not* the violent "Manga"
- type of Anime videos), the Amiga since Xmas 1990 (even after using Macs & PCs a
- lot, I find it the nicest computer to use), programming (mostly in AmigaE,
- ARexx, AmigaDOS, and occasionally C & VHDL), music (I couldn't live without
- it!), generally thinking up crazy theories & cool algorithms, learning Japanese
- (slowly), cunning things you can do in digital design (esp.multi-processors
- ATM), reading 'true' Manga (Japanese comics), cross country running (not that
- I've done much lately), and thats probably enough for now!
-
- Contacting the Author
- ---------------------
- Email: Chris.S.Handley@BTInternet.com
-
- If I cannot be reached by that email address, in an emergency I may be reached
- at Chris.S.Handley@eva01.freeserve.co.uk but I usually only check this maybe
- once a month!
-
- I am not yet sure about giving my snail mail address to all & sundry - sorry
- about that :-(I know how I felt when people did that before I had email access).
-
- Thanks to
- ---------
- o Andrija Antonijevic for HTTPResume
-
- o the Amiga community for sticking with the Amiga, and continuing to innovate.
- Give your backing to KOSH (originally proposed by Fleecy Moss).
-
- o CU Amiga for becoming the greatest Amiga mag over the last year, before
- passing away. I did not like AF's Xmas issue at all (and AF didn't appear to
- like my criticism of it either...)
-
- o whoever design the Rexx language - Rexx is great for 'user utilities'.
-
- History
- -------
- v0.64ß (04-04-99) - Put back the 'extra' END that I removed in v0.61 . Now
- BROKENLINKS will always only try to download external links
- once. Removed NOENV argument of HTTPResume so proxy
- settings may work. Minor changes.
- v0.63ß (04-04-99) - Removed spurious non-visible ASCII (27) characters that
- caused some text editors to go loopy.
- v0.62ß (03-04-99) - Add the BROKENLINKS switch. Replaced NOPAUSE by PAUSE
- switch. Now always warns if a file could not be downloaded
- (not just pages). If you used all the arguments then it
- would miss the last one.
- v0.61ß (28-03-99) - Possible fix for RESUME problem done, plus stupidly left an
- extra END where it broke GetAllHTML.
- v0.60ß (27-03-99) - First stand-alone Aminet release. Damn! There were 3 big
- mistakes... (a)some files expected as directories,
- (b)local-path expansion was too complex & probably wrong
- (hope right now), (c)implicit InDeX.hTmL files were not
- scanned for files. Also asked user to press return but
- really wanted a key first!
- v0.55ß (14-12-98) - Damn! All this fast programming has introduced some bugs,
- but they are fixed now; included the "~" interpretation was
- completely wrong (removed), and fixed a long standing bug
- where a URL beginning with a slash was mis-understood. Also
- added the BASEURL feature which is really useful sometimes.
- v0.54ß (12-12-98) - Given I couldn't download the KOSH pages (www.kosh.net), I
- added basic frame support, and fixed a long standing bug
- where root html pages could appear as empty directories!
- Two more long standing bugs fixed (ARC & PIC switches had
- inverted sense). Add fix for paths with "~" in, so will
- align path correctly. Add semi-intelligence so that won't
- ask about downloading a file with the same suffix as the
- last file that was confirmed. Add the TERSE function.
- v0.53ß (10-12-98) - The DEPTH feature now works, added the PORT feature,
- added the NOPAUSE feature. Fixed long standing bug of NOASK
- not being recognised. Now removes text in URL after "?"s.
- v0.52ß ( 8-12-98) - Basically update documentation ready for it's first Aminet
- release, for when packaged along with HTTPResume. Added an
- untested DEPTH feature in the special v0.52aß.
- v0.51ß (??-??-98) - Internal speed-up (may be much faster downloading small pages)
- - minor fix of arguments given to HTTPResume
- v0.5ß (??-??-98) - Initial release to a few people. No bugs, honest =:)
-